Detecting Tables in HTML Documents
نویسندگان
چکیده
Table is a commonly used presentation scheme, especially for describing relational information. Table understanding on the web has many potential applications including web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as elements, often the tag is used liberally to achieve multi-column layout effects rather than to present ralational information. In other words, a tag does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the web domain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is test on a large database which consists of HTML files collected from hundreds of different web sites from various domains and contains leaf elements, out of which are genuine tables. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of .
منابع مشابه
Understanding Tables on the Web
The Web contains a wealth of information, and a key challenge is to make this information machine processable. Because natural language understanding at web scale remains difficult and costly at present, in this paper, we focus our attention on understanding well-structured html tables on the Web. From 0.3 billion Web documents, we obtain 1.95 billion tables, and 0.5-1% of these contain meaning...
متن کاملDETECTING SIMILAR HTML DOCUMENTS USING A SENTENCE-BASED COPY DETECTION APPROACH by
DETECTING SIMILAR HTML DOCUMENTS USING A SENTENCE-BASED COPY DETECTION APPROACH Rajiv Yerra Department of Computer Science Master of Science Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional s...
متن کاملHTML Tag Based Metrics for use in Web Page Type Classification
Traditional machine learning classifications of HTML documents focus on features drawn from terms in the documents, the link structure of groups of documents, or a combination of both. These techniques attempt to generate topical classifications of documents, with the hopes of mirroring a human's classification of pages into subject areas, thus facilitating retrieval. This paper presents an alt...
متن کاملApplication of Radon Transform in Detecting Turning Angle of Bodies and in Reading Multi - Lingual Documents
Recently, image processing technique and robotic vision are widely applied in fault detection of industrial products as well as document reading. In order to compare the captured images from the target, it is necessary to prepare a perfect image, then matching should be applied. A preprocessing must therefore, be done to correct the samples’ and or camera’s movement which can occur during the...
متن کاملOn Table Extraction from Text Sources with Markups
Table extraction is the task of locating tables in documents and extracting their entries along with the arrangement of the entries inside the tables. The notion of tables applied in this work excludes any sort of meta data, e.g. only the content elements of the tables are to be extracted. We follow a simple unsupervised approach by selecting the tables according to a score that measures the in...
متن کامل